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ABSTRACT 

Both IP lookup and packet classification in IP routers can be 
implemented by some form of tree traversal. SRAM-based 
Pipelining can improve the throughput dramatically. How- 
ever, previous pipelining schemes result in unbalanced mem- 
ory allocation over the pipeline stages. This has been iden- 
tified as a major challenge for scalable pipelined solutions. 
This paper proposes a flexible bidirectional linear pipeline 
architecture based on widely-used dual-port SRAMs. A 
search tree is partitioned, and then mapped onto pipeline 
stages by a bidirectional fine-grained mapping scheme. We 
introduce the notion of inversion factor and several heuris- 
tics to invert subtrees for memory balancing. Due to its 
linear structure, the architecture maintains packet input or- 
der, and supports non-blocking route updates. Our exper- 
iments show that, the architecture can achieve a perfectly 
balanced memory distribution over the pipeline stages, for 
both trie-based IP lookup and tree-based multi-dimensional 
packet classification. For IP lookup, it can store a full back- 
bone routing table with 154419 entries using 2MB of mem- 
ory, and sustain a high throughput of 1.87 billion packets 
per second (GPPS), i.e. 0.6 Tbps for the minimum size (40 
bytes) packets. The throughput can be improved further to 
be 2.4 Tbps, by employing caching to exploit the Internet 
traffic locality. 

Categories and Subject Descriptors 

C.1.4 [Processor Architectures]: Parallel Architectures; 
C.2.6 [Computer Communication Networks] : Internet- 
working — Routers 

General Terms 

Algorithms, Design, Performance 

Keywords 

Packet classification, IP lookup, Pipeline, Terabit, Bidirec- 
tional, SRAM 

1. INTRODUCTION 

Modern IP routers need to offer not only IP lookup for 
packet forwarding, but also a variety of value-added func- 
tions, such as security and differentiated services. Most of 
those functionalities rely on packet classification where the 
packets are classified into different flows according to some 
set of pre-defined rules. Packet classification generally refers 
to the multi-field matching. IP lookup can be seemed as 



one dimensional packet classification where the destination 
IP address of a packet is matched against a set of prefixes. 

On the other hand, advances in optical networking tech- 
nology pose a big challenge on the design of high speed IP 
routers. Increasing link rates demand that packet processing 
in IP routers must be performed in hardware. For instance, 
40 Gbps links require a throughput of 8 ns per packet, i.e. 
125 million packets per second (MPPS), for a minimum size 
(40 bytes) packet. Such throughput is impossible using ex- 
isting software-based solutions for either IP lookup [18] or 
packet classification [7]. 

Most hardware-based solutions for high speed packet pro- 
cessing in routers fall into two main categories: ternary con- 
tent addressable memory (TCAM)-based and dynamic/static 
random access memory (DRAM/SRAM)-based solutions. 
Although TCAM-based engines can retrieve results in just 
one clock cycle, their throughput is limited by the relatively 
low clock rate of TCAMs. TCAMs are expensive and offer 
little fiexibility to adapt to new addressing and routing pro- 
tocols [To]. As shown in Table [T] SRAMs offer better scal- 
ability than TCAMs with respect to memory access time, 
density and power consumption. 

Table 1: Comparison of TCAM and SRAM tech- 
nologies (based on 18 Mbit chip) 





TCAM 


SRAM 


Maximum clock rate (MHz) 


266 16 


400 5, 19 


Cell size (# of transistors/bit) 


16 


6 


Power consumption (Watts) 


12 ~ 15 [24] 


«0.1[4] 



In DRAM/SRAM-based solutions, both IP lookup and 
packet classification can be implemented by some form of 
tree traversal. Each packet traverses a search tree in the 
memory, and retrieves its matched entry when it arrives at 
a tree leaf Such a search process needs multiple mem- 
ory accesses, which becomes a major drawback of tradi- 
tional DRAM/SRAM-based solutions. Several researchers 
have explored pipelining to improve the throughput. A sim- 
ple pipelining approach is to map each tree level onto a 
pipeline stage with its own memory and processing logic. 
One packet can be processed every clock cycle. However, 
this approach results in unbalanced tree node distribution 
over the pipeline stages. This has been identified as a dom- 
inant issue for pipelined architectures [3]. In an unbalanced 
pipeline, the "fattest" stage, which stores the largest num- 
ber of tree nodes, becomes a bottleneck. It adversely affects 



the overall performance of the pipeline in the following as- 
pects. First, more time is needed to access the larger local 
memory. This leads to a reduction in the global clock rate. 
Second, a fat stage results in many updates, due to the pro- 
portional relationship between the number of updates and 
the number of tree nodes stored in that stage. Particularly 
during the update process caused by intensive route/rule 
insertion, the fattest stage may also result in memory over- 
flow. Furthermore, since it is unclear at hardware design 
time which stage will be the fattest, we need to allocate 
memory with the maximum size for each stage. Such a kind 
of over-provisioning results in memory wastage [I] and ex- 
cessive power consumption. 

To balance the memory distribution across stages, sev- 
eral novel pipeline architectures have been proposed [U 1121 
|9]. However, none of them can achieve a perfectly balanced 
memory distribution over stages. Furthermore, due to the 
non-linear structures they employ, most of them must stall 
subsequent packets during a route update. 

We propose a SRAM-based bidirectional linear pipeline 
architecture, for both IP lookup and packet classification in 
IP routers. This paper makes the following contributions: 

• To the best of our knowledge, this work is the first 
one to achieve a perfectly balanced memory allocation 
over pipeline stages, for both IP lookup and multi- 
dimensional packet classification. The memory wastage 
due to over-provisioning is almost zero. 

• A bidirectional fine-grained mapping scheme is pro- 
posed to realize the above goal. We introduce the no- 
tion of inversion factor and several heuristics to invert 
the subtrees for memory balancing. 

• A novel bidirectional linear pipeline architecture is pre- 
sented to enable the above mapping. It maintains the 
packet input order and supports non-blocking updates. 

• Our simulation experiments using real-life data demon- 
strate the SRAM-based pipelined architecture to be a 
promising solution for next generation IP routers. The 
proposed architecture can store a full backbone rout- 
ing table with 154419 entries using 2MB of memory. It 
can sustain a high throughput of 1.87 billion packets 
per second (GPPS), i.e. 0.6 Tbps for minimum size 
(40 bytes) packets. 

The remainder of this paper is organized as follows. Sec- 
tion [2] reviews the background and related works. Section 
[3] discusses the memory balancing over pipeline stages and 
proposes a novel bidirectional fine-grained mapping scheme. 
Section[4]proposes a corresponding bidirectional linear pipeline 
architecture. Section [5] conducts simulation experiments to 
evaluate the performance of our approaches. Section [6] con- 
cludes the paper. 

2. BACKGROUND 

2.1 IP Lookup and Packet Classification 

2.1.1 Trie-based IP Lookup 

The nature of IP lookup is longest prefix matching (LPM). 
The most common data structure in algorithmic solutions 
for performing LPM is some form of trie [H]. A trie is a 



binary tree, where a prefix is represented by a node. The 
value of the prefix corresponds to the path from the root of 
the tree to the node representing the prefix. The branch- 
ing decisions are made based on the consecutive bits in the 
prefix. A trie is called a uni-bit trie if only one bit is used 
for making branching decision at a time. The prefix set 
in Figure [l] (a) corresponds to the uni-bit trie in Figure [T] 
(b). For example, the prefix "010*" corresponds to the path 
starting at the root and ending in node P3: first a left-turn 
(0), then a right-turn (1), and finally a turn to the left (0). 
Each trie node contains two fields: the represented prefix 
and the pointer to the child nodes. By using the optimiza- 
tion called leaf-pushmg [21], each node needs only one field: 
either the pointer to the next-hop address or the pointer to 
the child nodes. Figure [l] (c) shows the leaf-pushed uni-bit 
trie derived from Figure [T] (b). 
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Figure 1: (a) Prefix set; (b) Uni-bit trie; (c) Leaf- 
pushed uni-bit trie. 

Given a leaf-pushed uni-bit trie, IP lookup is performed 
by traversing the trie according to the bits in the IP address. 
When a leaf is reached, the prefix associated with the leaf is 
the longest matched prefix for that IP address. The time to 
look up a uni-bit trie is equal to the prefix length. The use of 
multiple bits in one scan can increase the search speed. Such 
a trie is called a multi-bit trie. The number of bits scanned 
at a time is called stride. Some optimization schemes [51 111] 
have been proposed to build a memory-efficient multi-bit 
trie. For simplicity, we consider only the leaf-pushed uni-bit 
trie in this paper, though our ideas are applicable to other 
forms of tries. 

2.1.2 Decision Tree based Packet Classification 

Multi-dimensional packet classification is one of the fun- 
damental challenges in designing high speed routers. It en- 
ables routers to support firewall processing. Quality of Ser- 



vice differentiation, virtual private networks, policy routing, 
and other value added services. An IP packet can be classi- 
fied based on a number of fields in the packet header, such 
as source/destination IP address, source/destination port 
number, type of service, type of protocol, etc. Fields are 
generally specified by range. When a packet arrives at a 
router, its header is compared against a set of rules. Each 
rule can have one or more fields and their associated values, 
a priority, and an action to be taken if matched. A packet 
is considered matching a rule only if it matches all the fields 
within that rule. 

Many packet classification algorithms are based on de- 
cision trees which take the geometric view of the packet 
classification problem. HyperCuts [20) is a representative 
of such algorithms. At each node of the decision tree, the 
search space is cut based on the information from one or 
more fields in the rule. HyperCuts algorithm allows cutting 
on multiple fields per step, resulting in a fatter and shorter 
decision tree. 

The searching algorithm in a HyperCuts tree is simple. 
When a packet header arrives at the root of the tree, it will 
traverse the decision tree until it finds either a leaf node or a 
NULL node. The leaf node will represent the first matching 
rule, and the NULL node will indicate that no match has 
been found. 

2.2 Memory-Balanced Pipelines 

Pipelining can dramatically improve the throughput of 
tree traversal. A straightforward way to pipeline a tree is to 
assign each tree level to a different stage, so that a packet 
can be processed every clock cycle. However, as discussed 
earlier, this simple pipelining scheme results in unbalanced 
memory distribution, leading to low throughput and ineffi- 
cient memory allocation. 

Basu et al. [3] and Kim et al. [11] both reduce the memory 
imbalance by using variable strides to minimize the largest 
trie level. However, even with their schemes, the size of 
the memory of different stages can have a large variation. 
As an improvement upon |Tl], Lu et al. [13] propose a tree- 
packing heuristic to balance the memory further, but it does 
not solve the fundamental problem of how to retrieve one 
node's descendants which are not allocated in the following 
stage. Furthermore, a variable stride multi-bit trie is diffi- 
cult for hardware implementation, especially if incremental 
updating is needed [3]. 

Baboescu et al. [T] propose a Ring pipeline architecture 
for tree-based search engines. The pipeline stages are con- 
figured in a circular, multi-point access pipeline so that the 
search can be initiated at any stage. A tree is split into many 
small subtrees of equal size. These subtrees are then mapped 
to different stages to create an almost balanced pipeline. 
Some subtrees must wrap around if their roots are mapped 
to the last several stages. Any incoming IP packet needs 
to lookup an index table to find its corresponding subtree's 
root which is the starting point of that search. Since the 
subtrees may be from different depths of the original tree, 
we cannot use a constant number of address bits to index 
the table. Thus, the index table must be built by content 
addressable memories (CAMs), which may result in lower 
speed. Though all IP packets enter the pipeline from the 
first stage, their lookup processes may be activated at differ- 
ent stages. All the packets must traverse the pipeline twice 
to complete the tree traversal. The throughput is thus 0.5 



packets per clock cycle. 

Kumar et al. [12] extend the circular pipeline with a 
new architecture called Circular, Adaptive and Monotonic 
Pipeline (CAMP). It uses several initial bits (i.e. initial 
stride) as the hashing index to partition the tree. Using the 
similar idea but different mapping algorithm from Ring[T], 
CAMP creates an almost balanced pipeline as well. Unlike 
the Ring pipeline, CAMP has multiple entry stages and exit 
stages. To manage the access conflicts between packets from 
current and preceding stages, several queues are employed. 
Since different packets of an input stream may have differ- 
ent entry and exit stages, the ordering of the packet stream 
is lost when passing through CAMP. Assuming the packets 
traverse all the stages, when the packet arrival rate exceeds 
0.8 packets per clock cycle, some packets may be discarded 
[12| . In other words, the worst-case throughput is 0.8 pack- 
ets per clock cycle. Also in CAMP, a queue adds extra delay 
for each packet, which may result in out-of-order output and 
delay variation. 

Due to the non-linear structure, neither the Ring pipeline 
nor CAMP in the worst case can maintain a throughput of 
one packet per clock cycle. Also, neither of them supports 
the non-blocking route update, since the ongoing update 
may conflict with the preceding or following packets. Our 
previous work [9| adopts an optimized linear pipeline archi- 
tecture, named OLP, to achieve a high throughput of one 
output per clock cycle, while supporting write bubbles [3] 
for non-blocking update. By adding nops (no-operations) 
in the pipeline, OLP offers more freedom in mapping tree 
nodes to pipeline stages. The tree is partitioned, and all 
subtrees are converted into queues and are mapped onto the 
pipeline from the first stage. However, in OLP, the first sev- 
eral stages may not be balanced, since the top levels of a 
tree have few nodes. 

2.3 Discussion 

State-of-the-art techniques cannot achieve perfectly bal- 
anced memory distribution, due to several constraints they 
place during mapping: (1) They require the tree nodes on 
the same level be mapped onto the same stage. (2) The 
mapping scheme is uni-directional: the subtrees partitioned 
from the original tree must be mapped in the same direc- 
tion (either from the root, or from the leaves). Actually, 
both constraints are unnecessary. The only constraint we 
must obey is: 

Constraint 1 : If node A is an ancestor of node B in a tree, 
then A must be mapped to a stage preceding the stage to 
which B is mapped. 

This paper proposes a flexible bidirectional linear pipeline 
architecture which provides a unifled architecture for both 
IP lookup and packet classification. By employing widely- 
used dual-port SRAMs, the architecture allows two flows 
from opposite directions to access the local memory in a 
stage at the same time. With a bidirectional fine-grained 
mapping scheme, a perfectly balanced memory distribution 
over pipeline stages is achieved. It has many desirable prop- 
erties due to its linear structure: (1) the worst-case through- 
put of one packet per clock cycle is sustained; (2) each packet 
has a constant delay to go through the architecture; (3) the 
packet input order is maintained; (4) non-blocking update is 
supported, that is, while a write bubble is inserted to update 
the stages, both the subsequent and the antecedent packets 
can perform the search as well. 
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Figure 2: Level-by-level mapping of routing tries onto 32 pipeline stages. 
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Figure 3: Level-by-level mapping of decision trees onto 25 pipeline stages. 



3. MEMORY BALANCING 

This section studies the problem of balancing the memory 
distribution across pipeline stages. We examine two typical 
mapping approaches, and then propose a novel bidirectional 
fine-grained mapping scheme. First, we define the following 
terms. 



Definition 
stages. 



1. The pipeline depth is the number of pipeline 



Definition 2. The depth of a tree node is the directed 
distance from the tree node to the tree root. The depth of a 
tree refers to the maximum depth of all tree leaves. 

Definition 3. The height of a tree node is the maximum 
directed distance from the tree node to a leaf node. The 
height of a tree refers to the height of the root. In fact the 
depth of a tree is equal to its height. 



Definition 4. In depth-based (height-based) mapping, two 
tree nodes are said to be on the same level if they have the 
same depth (height). 



3.1 Motivation 

The most straightforward mapping scheme is depth-based 
mapping [3], where the tree nodes with the same depth are 
mapped onto the same stage. In this scheme, the first stage 
always has one tree node i.e. the tree root. All the packets 
enter the pipeline from the first stage. Another level-by- 
level mapping scheme is height-based mapping 18', where 
the tree nodes with the same height are mapped onto the 
same stage. In this scheme, all tree leaves are mapped onto 
the first stage, and the tree root is mapped onto the last 
stage. All the packets enter the pipeline from the last stage. 

We study the effectiveness of the above two mapping schemes 
by conducting experiments on four representative routing ta- 
bles rrcOO, rrcOl, rrcOS and rrcll collected from [17] . We 
also collected four rule sets, /uil_100, ipclAk, acllAOk and 
fwljreal, from [15] and built them into decision trees using 
the HyperCuts algorithm [20l. According to Figures [2{a-b) 
and Figures [3ja-b), for both trie-based IP lookup and deci- 
sion tree based packet classification, the node distribution 
across the stages is extremely unbalanced after using either 
the depth-based or the height-based mapping. 
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Figure 4: Bidirectional fine-grained mapping for the trie in Figure [TJ 



We say the depth-based mapping is forward mapping since 
the mapping is begun from the root, while the height-based 
mapping is reverse mapping since it is begun from the leaves. 
Intuitively it will be more balanced if the two mapping 
schemes can be combined in an effective way. 

3.2 Bidirectional Fine-grained Mapping 

To achieve a perfectly balanced memory distribution over 
the stages, we propose a bidirectional fine-grained mapping 
schemqj, as shown in Figure |4] The main ideas are (1) fine- 
grained mapping: allowing two trie nodes on the same 
trie level to be mapped onto different stages; and (2) bidi- 
rectional mapping: allowing two subtrees to be mapped 
onto different directions. However, there are several issues 
to be addressed: 

• Partition the entire tree so that we can have several 
subtrees to be mapped in different directions. 

• Decide which subtree(s) should be inverted and mapped 
on the reverse direction. 

• Adapt and combine the depth-based and the height- 
based mapping schemes at each step. 

3.2.1 Tree Partitioning 

We use prefix expansion [21] to partition the tree. Several 
initial bits are used as the index to partition the tree into 
many disjoint subtrees. According to [2] and our observa- 
tion on the collected routing tables, few prefixes in real-life 
routing tables are shorter than 16. Hence, there will be little 
prefix duplication when we use fewer than 16 initial bits to 
expand the prefixes. 

3.2.2 Subtree Inversion 

In a trie, there are few nodes at the top levels while there 
are a lot of nodes at the leaf level. Hence, we can invert 
some subtrees so that their leaf nodes are mapped onto the 
first several stages. We propose several heuristics to select 
the subtrees to be inverted: 

1. Largest leaf: The subtree with the most number of 
leaves is preferred. This is straightforward since we 
need enough nodes to be mapped onto the first several 
stages. 



'^For simplicity, in this section we describe our scheme for 
the trie only. The scheme can be easily extended for the 
decision tree. 



2. Least height: The subtree of shortest height is pre- 
ferred. Due to Constraint 1, a subtree with a larger 
height has less fiexibility to be mapped onto pipeline 
stages. 

3. Largest leaf per height: This is a combination of the 
previous two heuristics, by dividing the number of 
leaves of a subtree by its height. 

4. Least average depth per leaf: Average depth per leaf 
is the ratio of the sum of the depth of all the leaves 
to the number of leaves. This heuristic prefers a more 
balanced subtree. 

Algorithm [l] finds the subtrees to be inverted, where IFR 
denotes the inversion factor. A larger inversion factor re- 
sults in more subtrees to be inverted. When the inversion 
factor is 0, no subtree is inverted. When the inversion factor 
is close to the pipeline depth, all subtrees are inverted. The 
complexity of this algorithm is 0(K) where K denotes the 
total number of subtrees. 

Algorithm 1 Selecting the subtree to be inverted 
Input: K subtrees. 
Output: V subtrees to be inverted. 
1: A'^ = total # of tree nodes of all subtrees, H = # of 

pipeline stages, V = Q, W = K. 
2: while V < K AND W < IFR x \N/H^ do 
3: Based on the chosen heuristic, select one subtree from 

those not inverted. 
4; V = V + l,W = W-l + #oi leaves of the selected 

subtree. 
5; end while 



3. 2. 3 Mapping Algorithm 

Now we have two sets of subtrees. Those subtrees which 
are mapped from roots are called the forward subtrees, while 
the others are called the reverse subtrees. We use a bidirec- 
tional fine-grained mapping algorithm (Algorithm [2|) . The 
nodes are popped out of the ReadyList in the decreasing 
order of their priority. The priority of a trie node is de- 
fined as its height if the node is in a forward subtree, and 
its depth if in a reverse subtree. The node whose priority is 
equal to the number of the remaining stages is regarded as 
a critical node. If such a node is not mapped onto the cur- 
rent stage, some of its descendants (if in a forward subtree) 
or ascendants (if in a reverse subtree) can not be mapped 
later. For the forward subtrees, a node is pushed into the 
NextReadyList immediately after its parent is popped. For 



the reverse subtrees, a node will not be pushed into the 
NextReadyList until all its children are popped. The com- 
plexity of this mapping algorithm is 0{HN) where H de- 
notes the pipeline depth and A'^ the total number of trie 
nodes. 

Algorithm 2 Bidirectional fine-grained mapping 
Input: K forward subtrees and V reverse subtrees. 
Output: H stages with mapped nodes. 
1; Create and initialize two lists: Ready List — (j> and 

NextReadyList — cf). 
2: Rn = # of remaining nodes, Rh ~ # of remaining stages 

= H. 
3: Push the roots of the forward subtrees and the leaves of 

the reverse subtrees into ReadyList. 
4: for i = 1 to // do 
5: Mi = 0, Critical = FALSE. 
6: Sort the nodes in ReadyList in the decreasing order 

of the node priority. 
7; while Critical = TRUE or [Mi < \R,^/Rh] and 

Readylist ^ 0) do 
8: Pop node from ReadyList and map into Stage i. 

9: if The node is in forward subtrees then 

10: The popped node's children are pushed into 

NextReadyList. 
11: else if All children of the popped node's parent 

have been mapped then 
12: The popped node's parent is pushed into 

NextReadyList. 
13: end if 

14: Critical = FALSE. 

15: if There exists a node Nc € ReadyList and the 

priority of Nc >= Rh — 1 then 
16: Critical = TRUE. 

17: end if 

18: end while 

19: R^ = R^- M„ Rh^Rh- 1. 
20: Merge the NextReadyList to the ReadyList. 
21: end for 



The effectiveness of the bidirectional mapping scheme is 
evaluated in Section (5] 

4. HARDWARE ARCHITECTURE 

To enable the bidirectional fine-grained mapping scheme, 
we develop a bidirectional linear pipeline architecture based 
on dual-port SRAMfl, as shown in Figure [5] 

4.1 Overview 

There is one Direction Index Table (DIT), which stores 
the relationship between the subtrees and their mapping di- 
rections: forward or reverse. For any arriving packet p, the 
initial bits of its IP address are used to lookup the DIT 
and retrieve the information about its corresponding sub- 
tree ST{p). The information includes (1) the distance to 
the stage where the root of ST{p) is stored, (2) the mem- 
ory address of the root of ST{p) in that stage, and (3) the 
mapping direction of ST{p) which leads the packet to dif- 
ferent entrance of the pipeline. For example, in Figure [5] if 
the mapping direction is forward, the packet is sent to the 

^Dual-port SRAMs have bee n standard components in many 
devices such as FPGAs [22] , 



leftmost stage of the pipeline. Otherwise, the packet is sent 
to the rightmost stage. 

Once its direction is known, the packet will go through 
the entire pipeline in that direction. The pipeline is con- 
figured as a dual-entrance bidirectional linear pipeline. At 
each stage, the memory has dual Read/Write ports so that 
the packets from both directions can access the memory si- 
multaneously. The content of each entry in the memory 
includes (1) the memory address of the child node and (2) 
the distance to the stage where the child node is stored. If 
the distance value is zero, the memory address of its child 
node will be used to index the memory in the next stage 
to retrieve the child node content. Otherwise, the packet 
will pass that stage without any operation but decrement 
its distance value by one. 

4.2 Incremental Route Updates 

We update the memory in the pipeline by inserting write 
bubbles [3]. The new content of the memory is computed of- 
fiine. When an update is initiated, a write bubble is inserted 
into the pipeline. The direction of write bubble insertion is 
determined by the direction of the subtree that the write 
bubble is going to update. Each write bubble is assigned an 
ID. There is one write bubble table in each stage. It stores 
the update information associated with the write bubble ID. 
When it arrives at the stage prior to the stage to be updated, 
the write bubble uses its ID to lookup the write bubble ta- 
ble. Then it retrieves (1) the memory address to be updated 
in the next stage, (2) the new content for that memory lo- 
cation, and (3) a write enable bit. If the write enable bit is 
set, the write bubble will use the new content to update the 
memory location in the next stage. 

Since the subtrees mapped onto the two directions are 
disjoint, a write bubble inserted from one direction will not 
contaminate the memory content for the search from the 
other direction. Also, since the pipeline is linear, all packets 
preceding or following the write bubble can perform their 
searches while the write bubble performs an update. 

4.3 Throughput Improvement by Caching 

In the above architecture shown in Figure O at most two 
packets are allowed to enter the pipeline at the same time. 
The throughput can be 2 packets per clock cycle (PPC) 
only if the two packets are in the two distinct directions. 
Usually, such a traffic balancing cannot be guaranteed in 
reality. Thus, the throughput is lower than 2 PPC when we 
insert 2 packets in one clock cycle. 

On the other hand, Internet traffic contains a great amount 
of locality due to the TCP mechanism and application char- 
acteristics [TO]. As shown in Figure[Sl some small caches can 
be added into the architecture to exploit the Internet traf- 
fic locality. The most recently searched packets are cached. 
Any arriving packet accesses the cache first. If a cache hit 
happens, the packet will skip traversing the pipeline. Oth- 
erwise, it needs to go through the pipeline. The cache can 
be organized in any associativity. We use full associativity 
as the default. For IP lookup, only the destination IP of 
the packet is used to index the cache, while for packet clas- 
sification, multiple fields of the packet header may be used. 
The cache update will be triggered, either when there is a 
route update that is related to some cached entry, or after a 
packet that previously had a cache miss retrieves its search 
result from the pipeline. Any replacement algorithm can be 
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Figure 5: Block diagram of the basic architecture. 
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algorithm is used as the default. 
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Figure 6: Block diagram of the enhanced architec- 
ture. 



5. PERFORMANCE EVALUATION 

This section evaluates the effectiveness of the proposed 
scheme and the performance of the proposed architecture. 
At first, we examine the memory balancing by using the bidi- 
rectional fine-grained mapping scheme. Then, we measure 
the throughput using real-life traffic traces. All experiments 
are based on simulation. 

5.1 Memory Balancing 

At first, we conducted the experiments on the four routing 
tables given in Section[3TT]to examine the effectiveness of the 
bidirectional fine-grained mapping scheme. We used various 
inversion heuristics and inversion factor to evaluate their 
impacts. In these experiments, the number of initial bits 
used for partitioning the trie is 12. Then, with appropriate 
parameter setting, we conducted the experiments on the four 
5-tuple rule sets to verify the effectiveness of our scheme for 
decision trees. 

5.1.1 Impact of the inversion heuristics 

As discussed in Section r3.2.2l we have four different heuris- 
tics to invert subtrees. Now we examine their performance 
and obtain the results shown in Figure [T] The value of the 
inversion factor is set to 1. According to the results, the least 
average depth per leaf heuristic has the best performance. 
It shows that, when we have a choice, a balanced subtree 
should be inverted. This can be explained that a balanced 
subtree has many nodes not only at the leaf level but also 



at the lower levels, which can help balance not only the first 
stage but also the first several stages. 

5.1.2 Impact of the inversion factor 

Using the largest leaf heuristic, we changed the value of 
the inversion factor. The results are shown in Figure [S] 
When the inversion factor is 0, the bidirectional mapping 
becomes fine-grained forward mapping only. The mapping 
turns to be fine-grained reverse mapping when the inversion 
factor is close to the pipeline depth so that all subtrees are 
inverted. 

5.1.3 Short Summary 

According to the above results for trie-based IP lookup, 
the bidirectional fine-grained mapping scheme can achieve 
a perfectly balanced memory distribution over the pipeline 
stages, by either using an appropriate inversion heuristic or 
adopting an appropriate inversion factor. This also shows 
that the architecture is flexible that it offers a large design 
space for adapting to different routing tables with various 
preflx distribution. In fact we conducted more experiments 
on 16 routing tables collected from [T7] and obtained similar 
results as are presented. 

5.1.4 Applying onto Decision Trees 

Node dislribution over stages 
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Figure 9: Bidirectional fine-grained mapping for de- 
cision trees. {Largest leaf heuristic; Inversion factor 
= 1) 
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Figure 7: Bidirectional fine-grained mapping with different heuristics. (Inversion factor = 1) 



Comparing Figures [2l^a-b) with Figures [3ja-b), we can 
find that the characteristics of the decision trees are distinct 
from those of the routing tries. The node distribution of the 
decision trees after the depth-based mapping is somewhat 
different from that of routing tries. There are quite a lot of 
nodes in the first several stages, so that we can invert few 
subtrees in the bidirectional mapping. Also, in HyperCuts, 
each step from a node to its children is a multi-dimensional 
cut, rather than a bit scan. Hence, we cannot use prefix 
expansion to partition the tree. Instead, we use only the 
first cut to partition the tree. Figure [9] shows the results for 
the bidirectional mapping of the decision trees. Compared 
to Figure 7(a) which uses the same setting but does not 



Table 2: IP header traces 



achieve a balanced distribution, Figure[9]exhibits a perfectly 
balanced node distribution over stages. 

5.2 Throughput Improvement 

We used real-life Internet traffic traces to evaluate the 
throughput performance of the proposed architecture. Two 
anonymized real- life traces were collected from [14] . Their 
information is listed in Table [21 Due to the unavailability of 
public IP traces associated with their corresponding routing 
tables, we generated the routing tables by extracting the 
unique destination IP addresses from the traces. 



Trace 


# of packets 


# of IPs 


APTH: AMP-1110523221-1 


769100 


17628 


IPLS: I2A-1091235138-1 


1821364 


15791 



The major parameters of the architecture include the in- 
put width, i.e the number of parallel inputs, denoted P; the 
pipeline depth, denoted H; the queue size, i.e. the max- 
imum number of packets allowed to be stored in a queue, 
denoted Q; and the cache size, i.e. the maximum number 
of packets allowed to be cached, denoted C. In these exper- 
iments, the default setting for the architecture parameters 
was P = 4, H = 25, Q = 2, C = 160. 

The performance metric is the throughput in terms of the 
number of packets processed per clock cycle (PPC). Note 
that in P-width architecture, the throughput < P. 

5. 2. 1 Impact of the input width 

We increased the input width, and observed the through- 
put scalability. Figure (TUj shows, with caching, the through- 
put scaled well with the input width, especially when P < 4. 
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Figure 8: Bidirectional fine-grained mapping with various inversion factors. {Largest leaf heuristic) 



5.2.2 Impact of the cache size and the queue size 

We evaluated the impact of the cache size and the queue 
size, respectively, on the throughput. The results are shown 
in Figures [11] and 1121 Caching is efficient in improving the 
throughput. With only 1% of the routing entries being 
cached, the throughput reached almost 4 PPC in a 4-width 
architecture. On the other hand, the queue size had little 
effect on the throughput improvement. A small queue with 
Q = 16 is enough for the 4-width architecture. 

5.3 Overall Performance 

Based on the previous experiments, we estimate the over- 
all performance of a 4-width 25-stage architecture. As Fig- 
I 8(b)|shows, for the largest backbone routing table rrcll 



with 154419 prefixes, each stage has fewer than 32K nodes. 
A 15-bit address is enough to index a node in the local mem- 
ory of a stage. Since the pipeline depth is 25, we need 
an extra 5 bits to specify the distance. Thus, each node 
stored in the local memory needs 20 bits. The total mem- 
ory needed for storing rrcll in a 25-stage architecture is 
20 X 2^^^ X 25 = 16 Mb = 2 MB, where each stage needs 80 KB 
of memory. We use CACTI 4.2J4 to estimate the memory 
access time and the power consumption. A 80 KB dual-port 
SRAM using 45 nm technology needs 0.53 ns to access, and 



dispatches 0.01 W of power. The maximum clock rate of 
the above architecture in ASIC implementation can be 1.87 
GHz. Considering the throughput of 4 PPC as shown in 
Figure [To] the overall throughput can be as high as 4 x 1.87 
= 7.5 G packets per second, i.e 2.4 Tbps for the minimum 
packet size of 40 bytes. Such a throughput is 14 times that 
of the state-of-the-art TCAM-based IP lookup engines [24] . 
The overall power consumption is 0.25 W, which is only one 
eighth of that of the "coolest" TCAM solution [23]. 

6. CONCLUSIONS AND FUTURE WORK 

This paper proposed a flexible dual-port SRAM based 
bidirectional linear pipeline architecture for scalable IP lookup 
and packet classification in IP routers. By using a bidirec- 
tional fine-grained mapping scheme, the tree nodes can be 
evenly distributed onto the pipeline stages. Due to its linear 
structure, the architecture can preserve the packet input or- 
der and support non-blocking route update. Using 2MB of 
memory to store a core routing table with over 150K entries, 
the architecture can sustain a high throughput of 0.6 Tbps 
and can further achieve 2.4 Tbps by employing caching. 

For multi-dimensional packet classification, the operations 
in each stage are more complex than for the simple trie-based 
IP lookup. This may affect adversely the pipeline perfor- 
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Figure 12: Throughput vs. Queue size. {P — 4, H 
25, C = 160.) 
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mance. We plan to develop new search data structures for 
packet classification so that pipelining can be more feasible. 
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